Importing libraries

Load the data

Data preparation

Solving missing data

We have the PREFERRED_RESTAURANT_TYPES column with a lot of missing data.

Other features also have missing data but the reason is because PURCHASE_COUNT is 0. Due to users have not ordered.

My approach for solving the missing data are:

  1. Keep the missing data related to PURCHASE_COUNT. Since it is a part of our USER_ID
  2. Checking the PREFERRED_RESTAURANT_TYPES feature then decide to keep it or drop it.

The 'PREFERRED_RESTAURANT_TYPES' column has a lot of missing data present as NaN, float type. The rest are present in string type like a list.

I do not transform to new columns since the element in list is different for each USER_ID. My decision is to remove the 'PREFERRED_RESTAURANT_TYPES' column

Normalized 'STORE_TYPE' column

Exploratory

The purposes of this part are:

  1. Getting myself to familiar with the data
  2. Provide a background info to answer the main task which is finding the user segmentation

Column-wise group

After dividing columns into group, we have 7 groups with some columns in others

REGISTRATION PURCHASE_QUANTITY MEAL_TYPE PURCHASE_VALUE PURCHASE_PLATFORM PURCHASE_TIME PURCHASE_VENUE OTHER
REGISTRATION_DATE PURCHASE_COUNT BREAKFAST_PURCHASES TOTAL_PURCHASES_EUR IOS_PURCHASES MOST_COMMON_HOUR_OF_THE_DAY_TO_PURCHASE General merchandise DISTINCT_PURCHASE_VENUE_COUNT
REGISTRATION_COUNTRY PURCHASE_COUNT_DELIVERY LUNCH_PURCHASES MIN_PURCHASE_VALUE_EUR WEB_PURCHASES MOST_COMMON_WEEKDAY_TO_PURCHASE Grocery AVERAGE_DELIVERY_DISTANCE_KMS
USER_ID PURCHASE_COUNT_TAKEAWAY EVENING_PURCHASES MAX_PURCHASE_VALUE_EUR ANDROID_PURCHASES AVG_DAYS_BETWEEN_PURCHASES Pet supplies
FIRST_PURCHASE_DAY DINNER_PURCHASES AVG_PURCHASE_VALUE_EUR MEDIAN_DAYS_BETWEEN_PURCHASES Restaurant
LATE_NIGHT_PURCHASES Retail store

Registration

Purchase_Quantity

Purchase_Meal_Type

Purchase_Value

Purchase_Platform

Purchase_Time

Purchase_Venue

Avg delivery distance & distinct venue

Correlation table between features

Row-wise group (by country)

Registration by top countries during Sep2019

Number of purchase & no purchase by top countries

Purchase_Meal_Type by top countries

Total purchase price & quantity by top countries

Purchase flatform by top countries

Purchase time by top countries

Purchase venue by top countries

Avg delivery distance & distinct venue by top countries

Exploratory Summary

1. REGISTRATION

2. PURCHASE QUANTITY

3. PURCHASE MEAL TYPE

4. PURCHASE PRICE BREAKDOWN

5. PURCHASE PLATFORM

6. PURCHASE TIME

7. PURCHASE VENUE

8. PURCHASE AVG DISTANCE & DISTINCT VENUE

9. CORRELATION

User segmentation

Criteria for a user segmentation

Criteria 1: Relevant to Wolt business

Wolt business model is sharing economy that mean that the success of Wolt significantly depend on external stakeholders:

  1. Wolt's suppliers: Restaurants, groceries, retails,...
  2. Wolt's logistic: Couriers
  3. Wolts' users: People who using wolt’s flatform to purchase items from wolt's suppliers

So these 3 stakeholders are doing great and sustainable, Wolt' business will growth

In order to find the indicator that could become a segmentation and help Wolt align with the success of these stakeholders. The target is to find one indicator, so I will follow the steps as below:

  1. Divide columns into each stakeholder based on the information that are relevant to them.
  2. Select columns that are most relevant for all stakeholders.
  3. If there is more than one column relevant to all stakeholders, i will move them to Criteria 2 & 3 and eliminate until only one left
SUPPLIERS COURIERS USERS
REGISTRATION_COUNTRY X
PURCHASE_COUNT X X X
PURCHASE_COUNT_DELIVERY X X X
PURCHASE_COUNT_TAKEAWAY X X
FIRST_PURCHASE_DAY X X
LAST_PURCHASE_DAY X X
USER_ID X
BREAKFAST_PURCHASES X X
LUNCH_PURCHASES X X
EVENING_PURCHASES X X
DINNER_PURCHASES X X
TOTAL_PURCHASES_EUR X X
DISTINCT_PURCHASE_VENUE_COUNT X X
MIN_PURCHASE_VALUE_EUR X X
MAX_PURCHASE_VALUE_EUR X X
AVG_PURCHASE_VALUE_EUR X X
PREFERRED_DEVICE X X
IOS_PURCHASES X X
WEB_PURCHASES X X
ANDROID_PURCHASES X X
USER_HAS_VALID_PAYMENT_METHOD X X
MOST_COMMON_HOUR_OF_THE_DAY_TO_PURCHASE X X X
MOST_COMMON_WEEKDAY_TO_PURCHASE X X X
AVG_DAYS_BETWEEN_PURCHASES X X X
MEDIAN_DAYS_BETWEEN_PURCHASES X X X
AVERAGE_DELIVERY_DISTANCE_KMS X X
General merchandise X X
Grocery X X
Pet supplies X X
Restaurant X X
Retail store X X

Based on the table above, we have 6 columns that are relevant to all stakeholders:

  1. PURCHASE_COUNT
  2. PURCHASE_COUNT_DELIVERY
  3. MOST_COMMON_HOUR_OF_THE_DAY_TO_PURCHASE
  4. MOST_COMMON_WEEKDAY_TO_PURCHASE
  5. AVG_DAYS_BETWEEN_PURCHASES
  6. MEDIAN_DAYS_BETWEEN_PURCHASES

Between PURCHASE_COUNT & PURCHASE_COUNT_DELIVERY, I select PURCHASE_COUNT because it included with takeaway option that benefit our users and suppliers.

MOST_COMMON_HOUR_OF_THE_DAY_TO_PURCHASE & MOST_COMMON_WEEKDAY_TO_PURCHASE also are interesting to know. For suppliers, they can be well prepare for that specific time. For couriers, they will be ready and target that working hours. However these data has a perfect normal distribution which do not give much any variable insight for segmentation.

Between AVG_DAYS_BETWEEN_PURCHASES & MEDIAN_DAYS_BETWEEN_PURCHASES, from the exploratory part we knew that the data of days between purchase are highly skewed. In this case, MEDIAN_DAYS_BETWEEN_PURCHASES is selected to reduce the affect of outliers.

Finally, we have two columns could be used to segment our users are PURCHASE_COUNT & MEDIAN_DAYS_BETWEEN_PURCHASES

Criteria 2: Stable & long-term approach

When choosing an indicator for segmentation, it should be stable and giving a long-term approach for the company. The segmentation should not easy to change regulary by other factor like business activities, promotions, etc.

Criteria 3: Easy to communicate, access & measure

Although we have selected MEDIAN_DAYS_BETWEEN_PURCHASES to segment our users, there is an option to invert the data, make it into a frequency from 0 to 1

I decided to keep the data as they are (by day) because:

  1. Easy to communicate a day time than a frequency between 0 to 1 to other stakeholders.
  2. Easy to integrate with business targets & metrics. Many of them are based on day, month, year time
  3. Easy to measure and track since other stakeholders can understand and integrate their activities together with it.

Creating user segmentation

Approach steps

From the boxplot and describe(), we knew that the distribution of median_days is right skewed. Also we have the following info:

  1. Median: 25 days
  2. Mean: 55.7 days
  3. Standard deviation: 75 days
  4. Min: 0 day
  5. Max: 421 day

We know that to have the data for median_days, an user will need to have at least 2 purchases. Based on this, we already can segment our first and second user group:

Segment 1: Users who registered but have not purchased, so Purchase_count = 0\ Segment 2: Users who have purchase_count = 1

Since the standard deviation of Median days is very large, i decide to reduce the outliers of the right tail with 1 standard deviation. This mean 75(std) + 55.7(mean) = 130.7 days as the point to remove outliers. However, i think this number is not so easy to remember, so i take the number 120 days instead as it is also 3 months time. 3 month is an appropriate and reasonable to call these users as rarely purchase users.

Segment 3: User who has median_days_between_purchase > 120 days\

With the median at 25 days, again i take a day that close to that and easy to memorize is 30 days. This number will split our last two user groups:

Segment 4: User who has median_days_between_purchase > 30 days & <= 120 days\ Segment 5: User who has median_days_between_purchase >= 0 days & <= 30 days

User segment

We have the user segmentation as follow:

  1. Frequent_Purchase : ~20%
  2. Occational_Purchase: ~11%
  3. Rarely_Purchase: ~5%
  4. One_Purchase: 19%
  5. No_Purchase: ~45%

Recommendation

Approach

PURPOSE: How to use the segmentation to reactivate different kinds of users for marketing team.

ACTION: We want to move users in other segments up until they become Frequent_Purchase users. My approach is groupby our user segment with their behavior (column-wise) to find differences in each group. After that i also combine with the findings in Exploratory Summary to give a recommendation.

Groupby our user segments with their behavior to find differences:

The only notable differences are in PURCHASE_VENUE, DISTINCT_VENUE and DAYS_TO_FIRST_PURCHASE in each user group

Recommendation

  1. No Purchase users:
  1. One & Rarely Purchase:
  1. Occasional Purchase:
  1. Frequent Purchase:

Pros & Cons

PROS:

CONS: